New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Yash/dev llava next #10749

Open

yashaswikarnati wants to merge 94 commits into main from yash/dev_llava_next

+11,149 −2,845

Collaborator

yashaswikarnati commented Oct 3, 2024

What does this PR do ?

Support training of LLaVa NeXt model.

Collection: [Note which collection this PR will affect]
Multimodal

Changelog

Added necessary task encoders for energon data module to support training LLaVA NeXT with NeVA Model

Usage

The only change is to use the energon based data module as shown below with existing NeVA model

from nemo.collections.multimodal.data.energon.config import MultiModalSampleConfig
from nemo.collections.vlm import LlavaNextTaskEncoder
from nemo.collections.multimodal.data.energon import SimpleMultiModalDataModule
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
    data_path = args.data_path
    image_processor = processor.image_processor
    tokenizer = processor.tokenizer

    multimodal_sample_config = MultiModalSampleConfig()

    task_encoder = LlavaNextTaskEncoder(
        tokenizer=tokenizer, image_processor=image_processor, multimodal_sample_config=multimodal_sample_config
    )
    data = SimpleMultiModalDataModule(
        path=data_path,
        tokenizer=tokenizer,
        image_processor=image_processor,
        num_workers=8,
        micro_batch_size=mbs,
        global_batch_size=gbs,
        multimodal_sample_config=multimodal_sample_config,
        task_encoder=task_encoder,
    )

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

yashaswikarnati requested a review from yaoyu-33

October 3, 2024 20:33

Collaborator Author

yashaswikarnati commented Oct 3, 2024

https://wandb.ai/joc/llava_next_energon/runs/70551i0j?nw=nwuserykarnati - fine tuning convergence run on LLaVA dataset

github-advanced-security bot found potential problems

View reviewed changes

nemo/collections/vlm/neva/data/llava_next_energon.py Fixed Show fixed Hide fixed

nemo/collections/vlm/neva/data/llava_next_energon.py Fixed Show fixed Hide fixed

nemo/collections/vlm/neva/data/llava_next_energon.py Fixed Show fixed Hide fixed

nemo/collections/vlm/neva/data/llava_next_energon.py Fixed Show fixed Hide fixed

nemo/collections/vlm/neva/data/llava_next_energon.py Fixed Show fixed Hide fixed

nemo/collections/vlm/neva/data/llava_next_energon.py Fixed Show fixed Hide fixed

ashors1 and others added 11 commits

October 15, 2024 12:57


          locate weights path within MegatronCheckpointIO

dcb38f3

Signed-off-by: ashors1 <[email protected]>


          small refactor

Signed-off-by: ashors1 <[email protected]>


          remove another instance of ckpt_to_weights_subdir

Signed-off-by: ashors1 <[email protected]>


          move ckpt_to_weights_subdir

eed4bad

Signed-off-by: ashors1 <[email protected]>


          Apply isort and black reformatting

52c0ad3

Signed-off-by: ashors1 <[email protected]>


          Apply isort and black reformatting

e5dbd61

Signed-off-by: artbataev <[email protected]>


          add weights path in save_checkpoint

45df47d

Signed-off-by: ashors1 <[email protected]>


          fix circular import

c49e2a6

Signed-off-by: ashors1 <[email protected]>


          Apply isort and black reformatting

d3ffd5d

Signed-off-by: ashors1 <[email protected]>


          handle saving in ckpt_to_weights_subdir

ea49e20

Signed-off-by: ashors1 <[email protected]>


          fix minor typo

c4c3fd5

Signed-off-by: ashors1 <[email protected]>

github-actions bot added the Multi Modal label

github-advanced-security bot found potential problems

View reviewed changes

examples/vlm/llava_next_energon_training.py

		@@ -0,0 +1,161 @@
		import argparse
		import os

Check notice

Code scanning / CodeQL

Unused import

Import of 'os' is not used.

examples/vlm/llava_next_energon_training.py

@@ @@ -0,0 +1,161 @@ @@
+              import argparse
+              import os
+              import sys

Check notice

Code scanning / CodeQL

Unused import

Import of 'sys' is not used.

examples/vlm/llava_next_energon_training.py

+              import os
+              import sys
+              import requests

Check notice

Code scanning / CodeQL

Unused import

Import of 'requests' is not used.

examples/vlm/llava_next_energon_training.py

+              import requests
+              import torch
+              from megatron.core.optimizer import OptimizerConfig
+              from megatron.energon import VQASample

Check notice

Code scanning / CodeQL

Unused import

Import of 'VQASample' is not used.

examples/vlm/llava_next_energon_training.py

+              import torch
+              from megatron.core.optimizer import OptimizerConfig
+              from megatron.energon import VQASample
+              from PIL import Image

Check notice

Code scanning / CodeQL

Unused import

Import of 'Image' is not used.

examples/vlm/llava_next_energon_training.py

+              from nemo.collections import llm, vlm
+              from nemo.collections.multimodal.data.energon import SimpleMultiModalDataModule
+              from nemo.collections.multimodal.data.energon.config import MultiModalSampleConfig
+              from nemo.collections.vlm import ImageDataConfig, Llava1_5Config7B, LlavaModel, LlavaNextTaskEncoder

Check notice

Code scanning / CodeQL

Unused import

Import of 'ImageDataConfig' is not used. Import of 'Llava1_5Config7B' is not used. Import of 'LlavaModel' is not used.

examples/vlm/llava_next_energon_training.py

+              from nemo.collections.vlm import ImageDataConfig, Llava1_5Config7B, LlavaModel, LlavaNextTaskEncoder
+              from nemo.lightning.pytorch.optim import CosineAnnealingScheduler
+              from nemo.lightning.pytorch.optim.megatron import MegatronOptimizerModule
+              from nemo.utils import logging

Check notice

Code scanning / CodeQL

Unused import

Import of 'logging' is not used.

examples/vlm/llava_next_energon_training.py

+                  # Global and micro batch sizes
+                  gbs = 32
+                  mbs = 4
+                  seq_length = 256

Check notice

Code scanning / CodeQL

Unused local variable

Variable seq_length is not used.

examples/vlm/llava_next_finetune_energon.py

+              from nemo.collections import llm, vlm
+              from nemo.collections.multimodal.data.energon import SimpleMultiModalDataModule
+              from nemo.collections.multimodal.data.energon.config import MultiModalSampleConfig
+              from nemo.collections.vlm import ImageDataConfig, LlavaNextTaskEncoder

Check notice

Code scanning / CodeQL

Unused import

Import of 'ImageDataConfig' is not used.

examples/vlm/llava_next_finetune_energon.py

+                  # Global and micro batch sizes
+                  gbs = 128
+                  mbs = 4
+                  seq_length = 4096

Check notice

Code scanning / CodeQL

Unused local variable

Variable seq_length is not used.

ashors1 and others added 13 commits

October 16, 2024 14:58


          bug fixes

3ae933e

Signed-off-by: ashors1 <[email protected]>


          fix undefined variable

f1fbec5

Signed-off-by: ashors1 <[email protected]>


          move function

Signed-off-by: ashors1 <[email protected]>


          Apply isort and black reformatting

994719e

Signed-off-by: ashors1 <[email protected]>


          fix adapter meta file path

ea51ab2

Signed-off-by: Chen Cui <[email protected]>


          Apply isort and black reformatting

871ac85

Signed-off-by: cuichenx <[email protected]>


          Merge branch 'refs/heads/main' into ashors/ckpt-subdirs

f5889ca


          Merge remote-tracking branch 'origin/ashors/ckpt-subdirs' into ashors…

df2c4b1

…/ckpt-subdirs


          fix mixtral test

5aec05b

Signed-off-by: ashors1 <[email protected]>


          fix mixtral test

2df54e3

Signed-off-by: ashors1 <[email protected]>


          use function for weights subdir

440a244

Signed-off-by: Chen Cui <[email protected]>


          address comments

b2883a1

Signed-off-by: ashors1 <[email protected]>


          move asserts

26a8d8d

Signed-off-by: ashors1 <[email protected]>

yashaswikarnati requested a review from pablo-garay as a code owner

October 20, 2024 22:02

akoumpa and others added 22 commits

October 24, 2024 14:38


          LoRA support for HF::AutoModelForCausalLM (#10982)

68e8968

* add LinearAdapter

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* add hf lora example

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove unused imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* subclass mixin

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* remove stale imports

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* undo

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fix scale

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* regex selector for peft

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* move lora

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* fmt

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* hf_auto_model_for_causal_lm finetune recipe

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>


          Change default for always_save_context to True (#11014)

a3630de

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>


          Add a build option to load_context (#10713)

b5686c2

* Add a build option to load_context

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Adding test

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Trying to fix failing CPU test

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>

* cherry-pick fix

Signed-off-by: Alexandros Koumparoulis <[email protected]>

---------

Signed-off-by: Marc Romeijn <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Co-authored-by: Alexandros Koumparoulis <[email protected]>


          Fix pip install (#11026)

a07902a

* Move AutoTokenizer inline

Signed-off-by: Marc Romeyn <[email protected]>

* Move einops to common requirements

Signed-off-by: Marc Romeyn <[email protected]>

* Move AutoTokenizer import to top-level again in fine_tuning

Signed-off-by: Marc Romeyn <[email protected]>

* Move megatron init inside nemo.lightning

Signed-off-by: Marc Romeyn <[email protected]>

* Make megatron_lazy_init_context work when transformer-engine is not installed

Signed-off-by: Marc Romeyn <[email protected]>

* Only import get_nmt_tokenizer when needed

Signed-off-by: Marc Romeyn <[email protected]>

* Apply isort and black reformatting

Signed-off-by: marcromeyn <[email protected]>

---------

Signed-off-by: Marc Romeyn <[email protected]>
Signed-off-by: marcromeyn <[email protected]>
Co-authored-by: marcromeyn <[email protected]>


          [WIP] Add docs for NEST SSL (#10804)

e127994

* add docs

Signed-off-by: stevehuang52 <[email protected]>

* update doc and fix missing param

Signed-off-by: stevehuang52 <[email protected]>

---------

Signed-off-by: stevehuang52 <[email protected]>


          Change dist ckpt defaults (#10913)

8eaf5a9

* Enable ckpt features by default (async ckpt), ckpt every 15mins and reduce preemption time to 1min

Signed-off-by: Shriya Palsamudram <[email protected]>

* fix ssm tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Make note that ckpt_async_save is disabled for SSMs

Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for SSMs with fix

Signed-off-by: Shriya Palsamudram <[email protected]>

* Disable async ckpt in the peft test as it is a known bug, add note.

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix failing unit tests

Signed-off-by: Shriya Palsamudram <[email protected]>

* Ashors/peft async ckpt (#11010)

* [WIP] prototype for supporting async checkpointing with peft

Signed-off-by: ashors1 <[email protected]>
Signed-off-by: Shriya Palsamudram <[email protected]>

* Enable async ckpt for the peft test

Signed-off-by: Shriya Palsamudram <[email protected]>

* Fix peft setup test

Signed-off-by: Shriya Palsamudram <[email protected]>

---------

Signed-off-by: Shriya Palsamudram <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Co-authored-by: ataghibakhsh <[email protected]>


          Akoumparouli/mixtral recipe fix r2.0.0 (#10994)

cde2e02

* Mixtral TP8 EP1

Signed-off-by: Alexandros Koumparoulis <[email protected]>

* Apply isort and black reformatting

Signed-off-by: akoumpa <[email protected]>

---------

Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: akoumpa <[email protected]>
Co-authored-by: akoumpa <[email protected]>


          added datamodule for llava-next

e2db0be


          modified state dict transform

5eb00b0


          neva model changes to support llava-next

d263a60


          remove accidentally checked in files

97025ee

Signed-off-by: Yashaswi Karnati <[email protected]>


          Apply isort and black reformatting

37c6c55

Signed-off-by: yashaswikarnati <[email protected]>


          remove unused imports

bac0f64


          added io_init to not save task_encoder and image_processor

da05cf1


          Apply isort and black reformatting

cfb521c

Signed-off-by: yashaswikarnati <[email protected]>


          added scripts for pretrain and finetune

d3a718f

Signed-off-by: Yashaswi Karnati <[email protected]>


          Apply isort and black reformatting

438c573

Signed-off-by: yashaswikarnati <[email protected]>


          [🤠]: Howdy folks, let's bump Dockerfile.ci to 73e7b58 ! (#10779)

29a2ed8

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <[email protected]>


          generation example

c93cda7


          Apply isort and black reformatting

b2689fd

Signed-off-by: yashaswikarnati <[email protected]>


          small change in llava next example

302afb7


          edited merge conflict

accc256

yashaswikarnati force-pushed the yash/dev_llava_next branch from 4ded144 to accc256 Compare

October 24, 2024 22:54

github-actions bot added core TTS ASR NLP CI common audio labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ASR audio CI common core Multi Modal NLP TTS

27 participants